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Abstract 

We present a new efficient method for approximate search in electronic 
lexica. Given an input string (the pattern) and a similarity threshold, the 
i_— ] ■ algorithm retrieves all entries of the lexicon that are sufficiently similar to 

^) ' the pattern. Search is organized in subsearches that always start with an 

exact partial match where a substring of the input pattern is aligned with 
■ a substring of a lexicon word. Afterwards this partial match is extended 

stepwise to larger substrings. For aligning further parts of the pattern 
with corresponding parts of lexicon entries, more errors are tolerated at 
each subsequent step. For supporting this alignment order, which may 
start at any part of the pattern, the lexicon is represented as a structure 
. that enables immediate access to any substring of a lexicon word and per- 

' mits the extension of such substrings in both directions. Experimental 

f"^ , evaluations of the approximate search procedure are given that show sig- 

nificant efficiency improvements compared to existing techniques. Since 
the technique can be used for large error bounds it offers interesting possi- 
bilities for approximate search in special collections of "long" strings, such 
as phrases, sentences, or book titles. 
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1 Introduction 



C3 ■ The problem of approximate search in large lexica is central for many applica- 

tions like spell checking, text and OCR correction |Kuk92[ |PHH + 97]. internet 
search |CB04[ IAK051 ILH99| , computational biology |Gus97| etc. In a common 
setup the problem may be formulated as follows: A large set of words/strings 
called the lexicon is given as a static background resource. Given an input string 
(the pattern), the task is to efficiently find all entries of the lexicon where the 
Levenshtein distance between pattern and entry does not exceed a fixed bound 
specified by the user. The Levenshtein distance |Lev66j is often replaced by re- 
lated distances. In the literature, the problem has found considerable attention, 
e.g. [DflMl IBYN981 IBCP021 IMSS3) . 

Classical solutions to the problem }0fl96j try to align the pattern P with 
suitable lexicon words in a strict left-to-right manner, starting at the left border 
of the pattern. The lexicon is represented as a trie or deterministic finite-state 
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automaton, which means that each prefix of a lexicon word is only represented 
once and corresponds to a unique path beginning at the start state. During the 
search, only prefixes of lexicon words are visited where the distance to a prefix 
P' of the pattern does not exceed the given bound 6. As a filter mechanism 
that checks if these conditions are always met, Ukonnen's method |Ukk85| or 
Levenshtein automata jSM02| have been used. The main problem with this 
solution is the so-called "wall effect": if we tolerate 6 errors and start searching 
in the lexicon from left to right, then in the first b steps we have to consider all 
prefixes of lexicon words. Eventually, only a tiny fraction of these prefixes will 
lead to a useful lexicon word, which means that our exhaustive initial search 
represents a waste of time. 

In order to avoid the wall effect, we need to find a way of searching in the 
lexicon such that during the initial alignment steps between pattern and lexicon 
words the number of possible errors is as small as possible. The ability to realize 
such a search is directly related to the way the lexicon is represented. In |MS04| 
we used two deterministic finite-state automata as a joint index structure for the 
lexicon. The first "forward" automaton represents all lexicon entries as before. 
The second "backward" automaton represents all reversed entries of the lexicon. 
Given an erroneous input pattern, we distinguished two subcases: (i) most of 
the discrepancies between the pattern and the lexicon word are in the first 
half of the strings; and (ii) most of the discrepancies are in the second half. We 
apply two subsearches. For subsearch (i) we use the forward automaton. During 
traversal of the first half of the pattern we tolerate at most 6/2 errors. Then 
search proceeds by tolerating up to b errors. For subsearch (ii) the traversal is 
performed on the reversed automaton and the reversed pattern in a similar way 
- in the first half starting from the back only 6/2 errors are allowed, afterwards 
the traversal to the beginning tolerates 6 errors. In |MS04| it was shown that 
the performance gain compared to the classical solution is enormous and at the 
same time no candidate is missed. 

In this paper we present a method that can be considered as an extension of 
the latter. The new method uses ideas introduced in the context of approximate 
search in strings in |WM92i |Mye94"l IBYN99I INBY991 INBYOOj . Assume that the 
pattern can be aligned with a lexicon word with not more than 6 errors. Clearly, 
if we divide the pattern into 6+1 pieces, then at least one piece will exactly 
match the corresponding substring of a lexicon word in the answer set. In the 
new approach we first find the lexicon substrings that exactly match such a given 
piece of the pattern ("good parts first"). Afterwards we continue by extending 
this alignment, stepwise attaching new pieces on the left or right side. For the 
alignment of new pieces, more errors are tolerated at each step, which guarantees 
that eventually 6 errors can occur. Since at later steps the set of interesting 
substrings to be extended is already small the wall effect is avoided, it does not 
hurt that we need to tolerate more errors. For this kind of search strategy, a new 
representation of the lexicon is needed where we can start traversal at any point 
of a word. In our new approach, the lexicon is represented as symmetric compact 
directed acyclic word graph (SCDAWG) [BBH+871 lIHS+Olj - a bidirectional 
index structure where we (i) have direct access to every substring of a lexicon 
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word and (ii) can deterministically extend any such substring both to the left 
and to the right to larger substrings of lexicon words. This index structure can 
be seen as a part of a longer development of related index structures [BBH+87, 
ISto95l lGus97l IBre98l IStoOOl IMaaOOl IfflS+Oll lfflS+05l IMM W09| extendi ng work 
on suffix tries, suffix trees, and directed acyclic word graphs (DAWGs) |Wei731 
IMcU761 IUkk95l ICS84l lBBH+85] . 

Our experimental results show that the new method is much faster than 
previous methods mentioned above. For small distance bounds it often comes 
close to the theoretical limit, which is defined as a (in practice merely hypothet- 
ical) method where precomputed solutions are used as output and no search is 
needed. In our evaluation we not only consider "usual" lexica with single-word 
entries. The method is especially promising for collections of strings where the 
typical length is larger than in the case of conventional single- word lexica. Note 
that given a pattern P and an error bound 6, long strings in the lexicon have 
long parts that can be exactly aligned with parts of P. This explains why even 
for large error bounds efficient approximate search is possible. In our tests we 
used a large collection of book titles, and a list of 351,008 full sentences from 
MEDLINE abstracts as "dictionaries". In both cases, the speed up compared 
to previous methods is drastic. Future interesting application scenarios might 
include, e.g., approximate search in translation memories, address data, and 
related language databases. 

The paper is structured as follows. We start with some formal preliminaries 
in Section [5J In Section [3] we present our method informally using an example. 
In Section [4] we give a formal description of the algorithm, assuming that an 
appropriate index structure for the lexicon with the above functionality is avail- 
able. In Section [5] we describe the symmetric compact directed acyclic word 
graph (SCDAWG). Section [5] gives a detailed evaluation of the new method, 
comparing search times achieved with other methods. Experiments are based 
on various types of lexica, we also look at distinct variants of the Levenshtein 
distance. In the Conclusion we comment on possible applications of the new 
method in spelling correction and other fields. We also add remarks on the 
historical sources for the index structure used in this paper. 

2 Technical Preliminaries 

Words over a given finite alphabet E are denoted P,U,V,W, . . ., symbols <r, cr^ 
denote letters of E. The empty word is written e. If W = o\ ■■■cr ni then 
W rev denotes the reversed word a n ■ ■ ■ o~\ . The i-th symbol <7j of the word 
W = o~\ ■ ■ • o~ n is denoted Wi . In what follows the terms string and word are 
used interchangeably. The length (number of symbols) of a word W is denoted 
\W\. We write U o V or UV for the concatenation of the words U, V G E*. A 
string U is called a prefix (resp. suffix) of W G E* iff W can be represented 
in the form W — U o V (resp. W = V o U) for some V G E*. A string V is 
a substring of W G E* iff W can be represented in the form W = U\ o V o [/ 2 
for some U\,U2 G E*. The set of all strings over E is denoted E*, and the set 
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of the nonempty strings over E is denoted E + . By a lexicon or dictionary we 
mean a finite nonempty collection V of words. The set of all substrings (resp. 
prefixes, suffixes) of words in T> is denoted Subs(T>) (resp. Pref(T>), Suf(T>)). 
The set of the reversed words from T> is denoted D rev . The size of the lexicon 
Vis \\V\\ :=Ewev\W\- 

Definition 2.1 A deterministic finite-state automaton is a quintuple 

A = (Q, E, s, 5, F) 

where S is a finite input alphabet, Q is a finite set of states, s £ Q is the start 
state, <5:QxE->Qisa partial transition function, and F C Q is the set of 
final states. 

If .4 = (Q, E, s, 8, F) is a deterministic finite-state automaton, the extended 
partial transition function 8* is defined as usual: for each q £ Q we have 
<5*(<7,£) = q. For a string Wa (W £ E*,cr G E) <5*(g, WV) is defined iff both 
8*(q,W) = p and 8(p,a) = r are defined. In this case, 8* (q, Wa) = r. We 
consider the size of a deterministic finite-state automaton A to be linear in the 
number of states \Q\ plus the number of the transitions \{(p, o~, q) \ 8{p 1 a) = q}\. 
Assuming that the size (number of symbols) of the alphabet E is treated as a 
constant, the size of A is 0(\Q\). 

Definition 2.2 A generalized deterministic finite-state automaton is a quintu- 
ple A = (Q, E, s, 8, F), where Q, E, s and F are as above and 8 : Q x E + — > Q 
is a partial function with the following property: for each q £ Q and each a £ E 
there exists at most one U £ E* such that 8(q, all) is defined. 

A transition 8{q, all) = p is called a a -transition from q. The above condition 
then says that for each q £ Q and each a £ E there exists at most one a- 
transition. In what follows, er-transitions of the above form are often denoted 

q — > p. Let 

V = {(q, V,p) | q,p S Q,A has a transition q p}. 

The size of the generalized deterministic finite-state automaton A is considered 
to be 0{\Q\ +J2(„. v,p)ev I^D' wn i cn i s n °t 0(\Q\) in general. 

2.1 Suffix tries for lexica 

The following definitions capture possible index structures for search in lexica. 
First, we define the trie for a lexicon T> as a tree-shaped deterministic finite- 
state automaton. Each state of this automaton represents a unique prefix of 
lexicon words. The final states represent complete words. Second, the suffix 
trie for V is defined as the trie of all suffixes in T>. 

Definition 2.3 Let T> be a lexicon over the alphabet E. The trie for V is 
the deterministic finite-state automaton Trie(T>) = (Q,T,,q e ,8,{qu \ U £ T>}) 
where Q = {qu | U £ Pref(T>)} is a set of states indexed with the prefixes in 
Pref(V) and 8(qu, a) = quo* for all U o a £ Pref(V). 
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Obviously, the size of Trie(T>) is 0(||X>||). While tries support left-to-right 
search for words of the lexicon, the next index structure supports left-to-right 
search for substrings of lexicon words. 

Definition 2.4 Let V as above. The suffix trie forV is the deterministic finite- 
state automaton STrie(V) := Trie(Suf(V)) . 

In general, the size of the suffix trie for V is 0(||Suf(2?)||) and ||Suf(2?)|| is 
quadratic with respect to ||X>||. For example, for every n G N the number of 
states in STrie({a n b n }) is (n + l) 2 . 

Bidirectional suffix tries. We now introduce a bidirectional index structure 
supporting both left-to-right search and right-to-left search for substrings of 
lexicon words. For U G S* always qjj is a state in STrie(T>) iff qxjrev is a 
state in STrie(V rev ). Hence, following Giegerich and Kurtz |GK97| . from the 
two suffix tries STrie(T>) and STrie(T> rev ) we obtain one bidirectional index 
structure by identifying each pair of states (qu i Qu r " v ) from the two structures. 

Definition 2.5 The bidirectional suffix trie for V is the tuple BiSTrie(T>) := 
{Q,Z,q e ,5 L ,6 R ,F), where (Q, E, q e , 5 R , G') = STrie{V), F := {q v G Q | U G 
V} and 6 L : Q x S -> Q is the partial function such that (Q rev , S, g £ , <S£ eu , G") = 
STrie(V rev ) for = {q v ^ \ q v G Q} and tf£ e " , x) = « L (gir,x). 

Example 2.6 The bidirectional suffix trie for V = {ear, lead, real}, is shown 
in Figure [T] 

As in the case of one-directional structures, the main problem is the size of the 
index. In general, the size of BiSTrie(T>) is quadratic in the size of the T>. 
The final structure, which will be presented in Section [SJ can be considered as 
a compacted version of the bidirectional suffix trie. 

Remark 2.7 It is known that a suffix tree for a lexicon V can be stored in space 
0(||2?||) and built online in time 0(||D||), |Ukk 95j . Suffix trees are compacted 
variants of suffix tries. In this paper we use compact directed acyclic word 
graphs |BBH+87l[iHS+05] which are minimized variants of suffix trees. 

2.2 Approximate search in lexica and Levenshtein niters 

Definition 2.8 The Levenshtein distance between V, W G £*, denoted c?i(V, W), 
is the minimal number of edit operations needed to transform V into W. Edit 
operations are the deletion of a symbol, the insertion of a symbol, and the 
substitution of a symbol by another symbol in S. 

In what follows, Id := {(a, a) \ a G S} is considered as a set of identity opera- 
tions. 

Definition 2.9 A set of generalized weighted operations is a pair (Op, u>) where 
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1. Op C E* x E* is a finite set of operations such that Id C Op, 

2. w : Op — > N assigns to each operation op G Op a nonnegative integer 
weight w(op) such that w(op) = iff op G Id. 

If op = (X, y) represents an operation in Op, then I (op), the left side of the 
operation, is defined as l(op) = X and r(op), the right side of the operation, is 
defined as r(op) — Y. The width of op G Op is |/(op)|. 

Definition 2.10 Let (Op,w) be a set of generalized weighted operations. An 
alignment is an arbitrary sequence a = op 1 op 2 ■ ■ ■ op n G Op* of operations 
op,j G Op. The notions of left (right) side and weight are extended to alignments 
in a natural way: 

1(a) = l(op 1 )l(op 2 ) . . A(op n ) 
r(a) = r(op 1 )r(op 2 ) . . . r(op n ) 

w ( a ) = Er=iW°Pi)- 

Note that Definition ^ . lOl does not permit overlapping of operations in the se- 
quence. In our setting, operations that transform the left side into the right side 
are applied simultaneously. Formally, each sequence of operations representing 
an alignment is a string over the alphabet Op. 

Definition 2.11 The generalized distance induced by a given set of generalized 
weighted operations (Op,w) is the function d : S* x S* — > N U {oo} which is 
defined as: 

d(V, W) = min{tu(Q!) | a G Op*, 1(a) = V and r(a) = W}. 

We say that a G Op* is an optimal alignment ofV and W iff 1(a) — V, r(a) = W 
and w(a) = d(V,W). 

Remark 2.12 In terms of Definition 12 . 1 II we can represent the Levenshtein 
as the distance induced by (Op l ,wl) where Op L = (S U e) x (E U e) \ {{e, e)} 
and wl(op) = 1 for all op ^ Id. 

Remark 2.13 Given a set of generalized weighted operations (Op, w), dynamic 
programming can be used to efficiently compute d(V, W) for strings V and W, 
|Ukk85HVer88] . 

In this paper, we are interested in solutions for the following algorithmic 
problem ("approximate search in lexica"): 

Let T> be a fixed lexicon, let d denote a given generalized distance between 
words. For an input pattern P G S* and a bound b G N, efficiently find all words 
W G V such that d(P, W) < b. 

Definition 2.14 Let b G N denote a given bound. By a Levenshtein filter for 
bound b we mean any algorithm that takes as input two words P, U G S* and 
decides 
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1. if there exists a string V G S* such that <2l(P, U oV) <b, 

2. if d L {P,U) < b. 

More generally, if d is any generalized distance, a filter for d for bound b is an 
algorithm that takes as input two words P, U G S* and decides 

1. if there exists a string V G S* such that d(P, {/ o V) < 6, 

2. if d(P,J7) < 6. 

Note that a filter for d for bound & does not depend on the lexicon V. 

The interest in filters of the above form relies on the observation that in 
approximate search in lexica we often face a given input pattern P G £*. When 
we traverse the lexicon, which is represented as a trie or automaton, we want 
to recognize at the earliest possible point if the current path, which represents 
a prefix U of a lexicon word, can not be completed to any word that is close 
enough to P (Decision Problem [T]). When reaching a final state representing a 
word W = U of the lexicon we want to check if W satisfies the bound (Decision 
Problem [5]) . In |Ofl96| , the matrix based dynamic programming approach was 
used to realize a Levenshtein filter. In |SM02) we introduced the concept of a 
Levenshtein automaton, which represents a more efficient filter mechanism. 

In what follows we make a more general use of filters. Our lexicon traversal 
below starts from a substring of a lexicon word, which is compared to a substring 
P of the pattern. In addition to steps where we extend substrings on the right 
using a filter of the above form, we also use steps where we extend substrings 
with new symbols on the left. In this situation we need to check for given 
P,U G £* if there exists a string V such that d{P, V o U) < b. This means that 
with suitable extensions of U on the left we might reach an interesting alignment 
partner for P among the substrings of lexicon words. 

Remark 2.15 Assume that we have an algorithm that, given a distance d 
induced by (Op,w) and a bound b, constructs a filter for extension steps on 
the right of the above form. We may build a second filter for the symmetric 
distance d rev := ({op rev | op G Op},w rev ) where op rev := (l(op) rev ,r(op) rev ) 
and w rev (op rev ) := w(op) for op G Op. Obviously, for given P,U G S* there 
exists a string V such that d(P, V o U) < b iff there exists a string V' such that 
d rev (P rev ,U rev o V') < b. Hence the second "reversed filter" can be used to 
control extension steps on the left. 

The use of filters is directly related to the "wall effect". When the lexicon 
offers many possibilities for extending a given prefix or substring of a lexicon 
word, then the search space in a crucial way depends on the bound b of the filter 
that is used. When using a large bound, a large number of extensions has to 
be considered. Note that typically short prefixes/substrings have a very large 
number of extensions in the lexicon, while long prefixes/substrings often point 
to a unique entry. From this perspective, the problem discussed in the paper 
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can be rephrased: we are interested in a search strategy where the use of large 
bounds in filters is only necessary for large substrings at the end of the search. 
When we construct alignments between the pattern and lexicon words, we want 
to build "good parts'' first. 

3 Basic Idea 

In this section we explain the idea of our algorithm using a small example. We 
also characterize the kind of resources needed to achieve its efficient implemen- 
tation. Consider the dictionary 

D = {ear, real, lead}. 

Suppose that for the pattern 

P = dread 

we want to find all words W in V such that dj,(P, W) < 2. The standard way to 
solve the problem is a left-to-right search in the lexicon, using a filter for bound 
2. As described above, we want to avoid the use of a large filter bound at the 
beginning of the search. We next illustrate a first approach along these lines, 
which is then refined. 

Let W in V such that d L (P,W) < 2. When we split P = dread into the 
three parts d, re, ad, then there must be a corresponding representation of W in 
the form W = W10W20W3 such that d L (d, Wi) + d L (re, W 2 ) + d L (ad, W 3 ) < 2. 
We distinguish three cases, di,(d, W\) = 0, di(re, W2) — 0, or e?i(ad, W3) = 0. 
This leads to the following three subtasks: 

1. Check if d represents a substring of a word in V. In the positive case, look 
for extensions V of d on the right to words of the form d o V £ T> such 
that d L (dread, dV) < 2. 

2. Check if re represents a substring of a word in T>. In the positive case, 
look for extensions V2 of re on the right and extensions V\ of reV2 on the 
left to words of the form V\ o re o V2 £ T) such that di(dread, VireV^) <2 

3. Check if ad represents a substring of a word in T>. In the positive case, 
look for extensions V of ad on the left to words of the form V o ad e V 
such that dL (dread, V"ad) < 2. 

The above task can be solved using an appropriate bidirectional index struc- 
ture. As an illustration^ we use the bidirectional suffix trie (cf. Def. I2.5[) for 
V = {ear, lead, real}, which is shown in Figure[T] The nodes qjj of the graph 
depicted correspond to the substrings U of our lexicon T>, nodes marked with 

x We should stress that BiSTrie(T>) is just used for illustration purposes. In general, the 
size of BiSTrie(T>) is quadratic in the size of the T>, which means that a more condensed 
structure is needed in practice. 
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Figure 1: The bidirectional suffix trie BiSTrie(T>) for V = {ear, lead, real} 
represents all substrings of V and allows extending each substring either to the 
right by following the solid arcs or to the left by following the dashed arcs. 



a double ellipse represent words in T>. Following the solid arcs we extend the 
current substring to the right. Starting from q £ and traversing solid arcs we find 
any substring. If we follow the dashed arcs we extend the current substring to 
the left. 

It should be obvious how we may use the graph to solve the three subtasks 
in our example mentioned above. As an example, we consider SubtaskO Using 
the index we see that re is a substring of a word in V. Right extension steps of 
re in the index are controlled using a Levenshtein filter for pattern suffix read 
and bound 2. We find the two extensions rea and real. Then, for the left 
extension steps we use the filter for the full pattern dread and bound 2. The 
index shows that both rea and real cannot be extended on the left. However, 
since already (dread, rea) < 2 and g?l (dread, real) < 2 the filter licenses 
the empty left extension. Among the two resulting substrings, real G V is a 
solution. In a similar way, solving Subtask [3] leads to the second solution lead. 
When we abstract from our small example, the above procedure gives rise to 
the following 

First search idea. Split P = Pi o ■ • ■ o Pb + i into b+1 parts Pi of approx- 
imately the same length and apply 6+1 subsearches. For the i-th subsearch, 
first check if P, is a substring of a lexicon word (Step 1). In the positive case, 
try to extend P; to larger substrings of lexicon words, using a Levenshtein filter 
for bound b (Step 2). 

A nice aspect of this a search strategy is that each subsearch starts with an 
exact match (Step 1), which represents a search with filter bound 0. However, 
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Figure 2: Reducing the original query (dread, 2) into simpler ones. As a result 
we obtain an ordered binary tree representing search alternatives. The labels of 
the arcs indicate what sort of filter has to be used at the extension steps. The 
label J- shows that we extend to the right and thus an ordinary filter is required, 
whereas the label F rev means that we extend to the left and therefore a reverse 
filter (see Remark I2.15P has to supervise this step. The bound that determines 
a filter coincides with the threshold of the query written in the parent node. 



afterwards in Step 2 we immediately use a Levenshtein filter for the full bound 
b for all left and right extension steps. If b is large, this may lead to a large 
search space. 

Improved search idea. We now look for a refinement where we can 
use small filter bounds for the initial extension steps. To this end, we first 
slightly generalize the problem and search all substrings V of words in T> such 
that cIl{P, V) < b. Afterwards we simply filter those substrings that represent 
entries of V. 

We illustrate the improved search procedure using again our small example. 
In what follows, the notation (dread, 2) is used as a shorthand for the algo- 
rithmic task to find all substrings V G Subs(V) such that ^(dread, V) < 2, 
and similarly for other strings and bounds. The expression (dread, 2) is called 
a query with query pattern dread and bound 2. Now consider the query tree 
depicted in Figure [2] The idea is to solve the problems labeling the nodes in a 
bottom-up manner. The three leaves exactly correspond to the Steps 1 in the 
three subtasks discussed above: in fact, to solve the problems (d, 0), (re, 0) and 
(ad, 0) just means to check if d, re, or ad are substrings of lexicon words. We 
then solve problem (dre, 1). This involves two independent steps. 

1 . We look for extensions of the substring d (as a solution of the left child in 
the tree) at the right. 

2. We look for extensions of the substring re (as a solution of the right child 
in the tree) at the left. 

It is important to note that both extension steps are controlled using a 
Levenshtein filter for bound 1 for P' = dre (see Figure [5]). As a result we obtain 
the single solution re for the query (dre, 1). The next step in the bottom- up 
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procedure looks at the root node (dread, 2). Solving this node again involves 
two independent steps. 

1 . We look for extensions of the substring re (as a solution of the left child 
in the tree) at the right. 

2. We look for extensions of the substring ad (as a solution of the right child 
in the tree) at the left. 

At this final step we cannot avoid the use of a Levenshtein filter for dread and 
bound 2. We respectively obtain (1) rea, real and (2) dead, lead. 

Comparing the two search strategies, we see that at least Subtasks 1 and 2 
have been replaced by a subsearch where we use filter bound 2 only at the last 
extension step where we already found re and want to solve (dread, 2) adding 
right extensions. More generally, search trees of this form offer a possibility to 
postpone the use of large filter bounds to the end of the search. Details will be 
given in the next section where we formally describe the refined procedure. 

Remark 3.1 In order to efficiently realize a bottom- up subsearch of the form 
indicated above we need 

1. an index structure that supports the following tasks: 

(a) given a string V, efficiently decide if V represents a substring of a 
lexicon word, 

(b) given a substring V of a lexicon word, give immediate access to all 
substrings of lexicon words of the form V o a that add one letter 
a G £ to the right, 

(c) given a substring V of a lexicon word, give immediate access to all 
substrings of lexicon words of the form a o V that add one letter 
a e £ to the left. 

2. A filter for the bound b specified at the parent node faced at an upward 
step. The filter takes as first input the query pattern P' specified at the 
parent node. Subsearches start with a given solution of the left (right) 
child query. When adding letters to the right (left) we use a conventional 
("reversed") filter, cf. Remark 12.151 

Remark 3.2 A similar idea was introduced by Navaro and Baeza- Yates, |NB Y00| . 
for approximate search of a pattern in the set of substrings of a long text. In 
jNBYOOj the authors use suffix arrays for its realization and analyze how to 
organize the splitting to optimize the efficiency of this approach in terms of the 
length of the text and pattern. Their theoretical results show that this tech- 
nique improves over the naive algorithm in some cases, but still it does not avoid 
the wall effect in general. In [NBY99] the same authors present an algorithm 
for online approximate search of substrings of a long text. Their algorithm, as 
the algorithm presented here, uses binary trees representing search alternatives 
to reduce the search space. The essential difference is that their algorithm is 
online, i.e. does not rely on a precomputed index. 
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4 Search procedure 



The purpose of this section is to provide a formal description of the approach 
considered in the previous section. In what follows we assume that D is a fixed 
lexicon and d is a given generalized distance induced by (Op,w). For input 
strings P 6 S* and a bound b £ N we want to retrieve all words W £ V such 
that e?(P, W) < b. We consider the case where each operation op £ Op has 
width < 1 (cf . Def. I2.9[) . In the Appendix we show how essentially the same 
technique can be used for arbitrary generalized distances. 

Definition 4.1 A query is a pair (P',6') where P' £ E* is a substring of P 
and b' < b. The set Sol D (P',b') := {V £ Subs(V) | cZ(P',V) < 6'} is called the 
solution set for (P',6'). 

The search procedure has three phases. We first build a search tree for the 
query (P, 6). Then, using a bottom- up procedure we solve all queries of the 
search tree, in particular (P, b). The final step is trivial. We simply select from 
So{d(P, b) those elements that represent entries of T>. 

4.1 Building the search tree for a pattern 

We explain how to obtain for a given pattern P a binary tree 7p with queries 
assigned to each node, see Figure [2] 

Select any rooted ordered tree T with 6+1 leaves Ai , . . . , Ab+i (enumerated in 
canonical left-to-right ordering) where each non-leaf node has exactly 2 children. 
Then decorate the nodes of T with queries to define the search tree 7p: Split 
the pattern P in the form P = P\ o P 2 ■ ■ ■ o P^+i where the Pi are substrings of 
P of almost equal length, i.e. ||p| — |Pj|| < 1 (1 < i,j < b + 1). To each leaf 
Ai of T assign the query (Pi, 0). To each non-leaf node r\ of T assign the query 
(P^ o • ■ • o Pj-i-6', b') where Ai, . . . , Xi+b' is the sequence of leaves representing 
descendants of 77 in P in the natural left-to-right ordering. (Note that the root 
of 7p has label (P, 6), which is the original query.) 

Example 4.2 In the example considered in Section[3]we had b = 2, P = dread, 

Pi = d, P2 = re, P3 = ad. As our starting point T for decoration, we selected 
one among two possible binary rooted ordered trees. 

Remark 4.3 The choice of a tree T satisfying the above conditions influences 
the time needed to solve the query. The general philosophy is to avoid queries 
of the form (P', b') where P 1 is a short word and b 1 is a large bound. A good 
choice is the use of a balanced tree where all paths from the root reach a certain 
length. Other optimizations represent a possible subject for future studies. 

4.2 Computation of solution sets 

For each query (P', b 1 ) of the tree 7p we compute a set Sd(P', b') in a bottom- 
up fashion. We shall prove below that Sd(P', b') is the solution set So1d(P', b') 
in each case. 
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Initialization steps. For a leaf query (Pi,0) we decide if Pj is a substring 
of a lexicon word. In the positive case we let iSx>(Pi,0) := {Pi}, otherwise we 
define S* D (P,0) := 0. 

Definition 4.4 Extension steps. Let (P' ; 6') denote the query at a non-leaf 
node 7/ of Tp, let (P/,^) and (P^,^) denote the queries of the two children 
?7i)7?2 of 7/, which are given in the natural left-to-right ordering. Given the sets 
5d(P[,&'i) and SD(P 2 ,b' 2 ) we define Sd(P', b') as the union of the two sets Si 
and $2 defined as 

51 := {f/oy e Subs(V) | 1/ e S D (P{,b[),d(P{,U) + d(P^V) < b'} 

5 2 := {VoU eSubs(V)\U eS D (P^b' 2 ),d(P[,V) + d(P^,U)<b'} 

Proposition 4.5 The computation of solution sets is correct: for each query 
(P',b') ofTp we have S D (P',b') = Sol D (P',b'). 

4.3 Correctness proof and remarks 

To prove Proposition 14. 5i some preparations are needed. 

Remark 4.6 Let rj denote a non-leaf node of Tp decorated with query (P', b'). 
Let (P{, bi) and (P 2 , b' 2 ) denote the queries of the two children 771, r] 2 of 77, which 
are given in the natural left-to-right ordering. Then we have P' = P{ o P 2 and 
b' x + b' 2 = b'-\. 

Proposition 4.7 Let T> and d = (Op,w) as above, assume that each operation 
in Op has width < 1. let b E N. If P € S* is a word with P = Pi o P 2 and 
a G Op* is an alignment with 1(a) = P and weight w(a) < b, then 

1. a can be represented in the form a — a± o a 2 such that l(a\) = Pi and 
l{a 2 )=P 2 . 

2. if b' and b" are two arbitrary integers with the property b' + b" = b — 1, 
then w(ai) < b' or w(a 2 ) < b" . 

Proof. Since a is a sequence of operations a = opi . . . op n and P = 
l(op\) . . . l(op n ), the first part follows immediately from the fact that \l(opi)\ < 1 
(1 < i < n). The second statement is an obvious consequence. □ 

Corollary 4.8 If P and W are arbitrary words with d(P, W) < b and P = 
Pi P2, then: 

1. W can be represented in the form W — Wi W 2 such that d(P, W) = 

d{P 1 ,W 1 ) + d{P 2 ,W2)- 

2. if b' + b" = b- 1, then d(P u Wi) < b' or d(P 2 ,W 2 ) < b" , 
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Proof. Let a be an optimal alignment of P and W. Then w(a) = d(P, W) < 
b. We can define Wi = r{oti) for i = 1, 2 where a, are the alignments provided 
by Proposition 14.71 The second statement follows since iy(a,) > d(Pj, Wj), by 
the definition of a d-distance. □ 

(Proof of Proposition r4.5l ) This is obvious for the leaf queries. Consider 
a non- leaf node 77 of 7p with query (P',b'), let {P[,b' 1 ) and (P^,^) denote the 
queries of the two children 771,772 of 77, which are given in the natural left-to- 
right ordering. We may assume that 5d(P/,6J) = So]u(P/,&9 for z = 1,2. 
Remark 14.61 shows that P' = P[ o P2 . Consider an element U o V of S\ (we 
use the notation introduced in Section T4.2p . We have d(P',U o V) = d(P{ o 
P£, U o V) < d{P{, U) + d(P£, V) < V which shows that U o V G Sol D (P', b'). 
Hence Si C So1d(P' ,b'). Similarly we see that S2 C So1d(P' ,b'). Conversely 
consider an element W G Sol D (P', b'). Since P' = P[ o P^, Corollary 14.81 shows 
that W can be represented as W = Wi o W 2 such that d(P{, Wi) + d(P^, W 2 ) = 
d(P',W) and we have (1) d(P{,Wi) < &i or (2) d{P^W 2 ) < b' 2 . In case (i), 
Wi G 5 fl (P/, 6^ = Sol D (P{, b'J, which shows that W = W x o W 2 is found in 5 ?; 
(1 = 1, 2). Hence Sol D (P', b') C S*i U 5 2 = 5d(P', 6')- □ 

Remark 4.9 It is simple to see that in the example presented in the previ- 
ous section the computation of solutions sets follows exactly the above proce- 
dure. Following the definition of the Extension steps, we need to construct 
the complete sets of candidates Sd{P{iV\) an d SuiP^b^) in order to com- 
pute the sets Si and S2 for the query node (P',b') and eventually determine 
Sd(P' , b') = Si U S 2 - This corresponds to a bottom- up traversal of the search 
tree Tp. 

Remark 4.10 Observe that given a candidate Ui G SD(P{,b'i), the set of suc- 
cessful candidates Ui a Vi G Si with d(Ui, P{)+d(Vi, P 2 ) < b' which result in the 
right extension steps depend only on the specific candidate Ui, the dictionary 
V and the query node (P',b') but not on SD(Pi,b[). A similar observation is 
valid for the successful candidates U 2 G SD(P 2 ,b' 2 ). 

Remark 4.11 Remark 14.101 means that the target set S(P,b) can be con- 
structed by using any traversal algorithm A of the search tree 7~(P) which 
satisfies the following three conditions: 

1. it correctly initializes the candidate sets SoiPi, 0) for each leaf (Pj, 0). 

2. if A generates a candidate U\ in a node (P[,b[) which is a left child of 
(P', b'), then A generates also all successful candidates Ui o Vi for the node 
(P', b') such that d(U u P{) + d(Vi,P£) < b'. 

3. if A generates a candidate U 2 in a node [P 2l b' 2 ) which is a right child of 
(P', &'), then A generates also all successful candidates Ui o Vi for the node 
(P',b') such that d(U u P{) + d(Vi,P£) < b' . 

In particular one can replace the bottom-up traversal by a depth first search 
algorithm. 
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Remark 4.12 It is obvious to see that the efficient realization of the above 
search algorithm can be based on the resources described in Remark 13.11 The 
efficient computation of the sets Si and S2 in the bottom-up steps is achieved by 
using the given index structure for extensions on the right and left, respectively. 
Each extension by a single letter is controlled using a filter for the generalized 
distance d for the appropriate bound. As we mentioned earlier, the index struc- 
ture shown in Figure Q] only serves for illustration purposes. When using this 
construction there is a one-to-one correspondence between the nodes and the 
substrings of lexicon words. In general, the number of substrings of entries in T> 
is quadratic in the size (number of symbols) of T>. In the next section we shall 
describe an index structure that has the same functionality and needs storage 
space linear in the size (number of symbols) of the lexicon V. 

Remark 4.13 The approach that we proposed in this section is closely related 
to the algorithm of Myers |Mye94| for approximate search in strings. The main 
difference is that we have a fixed threshold b for the number of errors, whereas in 
|Mye94| the threshold is given in terms of percentage of symbols. This imposes 
different ways of handling the arising situation and modifications related with 
the application of the pigeonhole principle. Thus for query words of length at 
least b + 1 we shall always have an initialization with an exact match, whereas 
in Myers' situation this assumption is not obligatory fulfilled and he is not able 
to use it. 

5 Symmetric compact directed acyclic word graphs 

In this section we describe the bidirectional index structure for search in the 
lexicon. Afterwards we explain how the index structure supports the compu- 
tation of solutions sets described in Section |4~21 during online steps. As before, 
V denotes the given lexicon, #X>$ denotes the variant where the new symbols 
# and $ are attached as the first and the last symbol to each lexicon word, 
£ #$ = £U{#,$}. 

5.1 The index structure 

In Section 12.11 we described how suffix tries and suffix tries for reversed words 
of the lexicon can be merged into a bidirectional (quadratic) index structure, 
using a bijective correspondence between the states of the two substructures. We 
now introduce a bidirectional index of linear size, which is used in our method 
for approximate search. Furthermore we present a new algorithm for online 
construction of such index. To build the index we crucially use an algorithm 
from |IHS + 05| for online construction of one-directional compact directed acyclic 
word graphs. 

Definition 5.1 Let W G let < i < \W\. A word V G of length 

< |V"| < \W\ is said to start at position i in W if i + \V\ — 1 < \W\ and 
WiWi+i . . . Wj_|_iyi_i = V. Similarly V is said to end at position i in W if 
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< i — \V\ + 1 and Wj_|y| + iWj_|y| + 2 • • • Wj = V}. We define the functions 
startposw and endposw as 

startposw (V) := {i 6 N | V starts in W at position i}, 
endposw(V) := {i € N | V ends in W at position i}. 

In addition, let startposwi?) '■= endposw(e) := {0, 1, ... , | W|}. 
We define the equivalence relations ~=?w and ^w on as: 

X^v/Y <^=> siariposn/(X) = startposvK(^)j 
X^wY <^=> endposvi/(X) = endj>osvi/(F). 

Definition 5.2 The equivalence relations ^#x>$ an d ^#r>$ are defined on 
Subs(#£>$) as follows. For every X, Y <= Subs(#V$) 

X^ #m Y VVF 6 #£>$ : X^F, 

X^ #X)$ y ^ VVK G #£>$ : Xi= w Y. 

In what follows, the equivalence class of a substring V £ Subs(ffV$) w.r.t. 
^#r>$ (^#r>$) is written [V^ (fv^j ) . It is easy to prove the following properties 
of the function startposw (endposw) ■ 

Proposition 5.3 Let W,X,Y€ £« be arbitrary strings. 

1. If startposw (X) n startposwiY) ^ (endposw(X) PI endposw(Y) ^ $), 
then X is a prefix (suffix) of Y or vice versa, 

2. IfY is a prefix (suffix) ofX, then startposw {X) Q startposw(Y) (endposw(X) C 
endposw (Y) ). 



Proposition 15.31 can be used to show that for any two elements X, Y G [V^ 
(X, Y £ fv]) either X is a prefix (suffix) of Y or vice versa. Consequently we 
can define the canonical representative X of [x\ 0C of fx]) as the longest word 

1 e \X] (X e fx]). 

Proposition 5.4 flHS+Otf For every X 6 Subs(ffV$) there uniquely exist 

X* = X/3. 

Definition 5.5 For every X £ Subs(#T>$) we define A := aJ/? where A 7 = 

al and X = X/3. The equivalence relation *^#v$ is defined on Subs(#T>$) as 
follows. For every 1,7 £ Subs(#2>$) 



X^ 



In what follows the equivalence class of X w.r.t. — is written [X]. We shall 
use ^ , =^ and ^ as shorthands for ^#£>$, ~^#t>$ an d ^#x>$ respectively. 
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Proposition 5.6 [BB H + 8T\j The equivalence relation is the transitive clo- 
sure of ~^ and —. 

Proposition 5.7 The equivalence relation — is right-invariant: for arbitrary 
substrings X,Y £ Subs(^D$) and arbitrary extensions of the form XolI,YoU € 
Subs(=f^T>$) always X^Y implies X oU^=Y oU. The equivalence relation — is 
left- invariant. 

Definition 5.8 |BBH+87| The directed acyclic word graph (DAWG) for #P$ 

is the deterministic finite-state automaton _4(#Z>$) := {Q*7, \s\, 8<j, F*j) 
where 

• is the set of all equivalence classes [V] w.r.t. 

• the start state is \e] 6 Q^, 



• the (partial) transition function 5*j is defined as <5+j([V], cr) = [V o a] for 
all substrings V o a of #2?$, 

• the set of final states is F% := {fv} \ fv] n #2?$ + 0}. 

Note that the right-invariance of — implies that A is well-defined. The DAWG 
for #X>$ can be used (i) to check if a string V is in Subs(#T>$) and in the positive 
case (ii) to check in constant time if a right extension V o a again represents 
such a substring: for solving problem (i) we start a traversal of A(jfT>$) from 
[e] with the letters of V. Then V £ Subs(#T>$) iff all transitions are defined. 
In the positive case the traversal leads to the state fv]. To solve problem (ii) 

we check if 8^ ( [V] , cr) is defined. Note that for tasks (i) and (ii) we need not fix 
a set of final states. With the above definition of final states we may check if a 
substring of the form #F$ represents a full entry of the lexicon. This holds iff 
8t-([e],#V$) is (defined and) final. Analogously, using we define ^t(#2?$), 
which can be used to check for left extensions a o V. The question is how to 
merge A and A into one bidirectional index. 

Definition 5.9 |BBH + 87] The compact directed acyclic word graph (CDAWG) 
for #2?$ is the generalized deterministic finite-state automaton C (#2?$) := 
{Qf, £#$, fef, <5^-, F^) where 

• Q-r^ is the set of all equivalence classes fv\ w.r.t. 

• the start state is [e] £ Q-g-, 

• the (partial) transition function 8^ : Q<^ x — > Q<^ is defined for a 

string of the form aU (a £ U £ £^ $ ) iff V o a = V o all. The 

value is 
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• the set of final states is F*g := {[V] \ [V] n #2?$ / 0}. 

1j(#X>$) can be considered as a compacted variant of 5l(#2?$), because tf(#2?$) 
can be obtained from A(ffD%) by replacing chains of the type go <Zi 
g2 ■ • ■ — ? <Zn with multi-letter transitions go 1 " 1 " g n iff states qi for 
1 < i < n are implicit and q\ and g„ are explicit, |BBH + 87] . Analogously 
we define ~C(#T>$) with (partial) transition function S-g: 

< 7-X> 



<%(fvl aU) = [a o V] <^> a o V = U rev a o V. 

C (#£>$) can be considered as well as a compacted variant of ^t(#2?$) in the 
above sense. Note that both automata have the same set of states. Hence 
C (#X>$) and C (#2?$) are naturally merged into one bidirectional index. The 
following index is used in our method for approximate search. 

Definition 5.10 |IHS + 0l] The bidirectional symmetric compact acyclic word 
graph (SCDAWG) for #£>$ is 

"c"(#2?$) := (Q^,S+ $ ,tif,%,^,^). 

Linear description of SCDAWGs. Our next goal is to show how to repre- 
sent C (#2?$) in space linear in the lexicon. 

Proposition 5.11 The following inequalities hold for the number of states \Q*^\, 
the number of the transitions in 5^, \5*g\, and the number of the transitions in 

5-g, \5-g\, in the SCDAWG V(#X>$): 

\Q?\ <2||#X>||, 
max(|%|,|^|) < 2||#2?|| - 1. 

Proof. First we shall give upper bounds for size of the suffix tree for #2?$, 
defined as the generalized deterministic finite-state automaton STree(#T>$) := 

(Q,E m ,~^,5,F) where Q := {]£\X £ Subs(#V$)}, F := Suf(#V$) and for 

y — — > 

a G S #$ and U E S + $ , 5(X, aU) = X o a iff X o aU G Q and X o a = X o aU, 

|IHS + 0l| . The suffix tree represents a tree with root e and leaves F. For the 

number of the leaves we have \F\ < 1 + Each internal node X of the 

suffix tree has at least two successors. Hence \Q\ < 2\\#V\\. For the number of 

transitions in the suffix tree we have \S\ < 2||#X>|| — 1. For every A G Subs(#T>$) 

it can be shown that X* G Q and for every transition [X] °4 [Y] in <5<j it can 

be shown that there is a suffix tree transition A^ X^aU in 5. Consequently 
\Q*(?\ < |Q| an d \5<g\ < \5\. Analogously we obtain that \5-g\ is bounded by the 
number of transitions in STree($T> rev #). □ 



2 A state q is called implicit iff q is not the start state, q is not final and q has exactly one 
outgoing transition. A state is called explicit iff it is not implicit. 
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Remark 5.12 Note that Proposition 15.111 is not sufficient to prove that the 
size of SCDAWG is 0(||#2?$||), since the labels of the transitions are strings 
in To achieve a linear description of SCDAWG the transitions are rep- 

resented as follows. Let D be a concatenation of all strings in #X>$. For ev- 
ery state q = R we store a position end( q ) = i in D where V terminates, 
Y^ = V|+i^i-| V|+2 ' ' ' ^ >i ' ^ or ever y transition t = fx] A [F] in <5<j let 
start(t) := — \a\ + 1. Since A a is suffix of Y , for every transition t 

in 5<£ we store only start(t), but not the whole label a. In analogous way we 
define start(q) and end(t) and for every transition t in 5-^- we store only end(t). 

Online construction of SCDAWGs in linear time. In |IHS + 0l| Inenaga 
et al. present an online algorithm that builds letter by letter the SCDAWG 
^({#VK$}) for a single string #W% in time 0(|#W$|). Here we present a 
new straightforward online algorithm that builds a representation of C (jfV$) 
string by string in time 0(||#X>$||). Our result is essentially based on another 
algorithm by Inenaga et al. |IHS + 05j . that constructs C (|#2?$|) letter by letter 
in an online manner. The idea is to synchronize C (#£>$) and C (%T> rev ff) while 
simultaneously building both of them word by word. 

Proposition 5.13 (?(#!>$) and < C($V rev #) are isomorphic. 

Proof. Let = (Q, M, 5 t ,F) and t($V rev #) = (Q>, H, 8', F'). 

The isomorphism is given by the bijection b : Q — >• Q 1 defined as follows. 



b([X\) := [X rev ]. 

□ 

bmce ^(#©$) and t{#V$) have one and the same set of states, the bijection 
b, defined in the above proof, provides the way to express 8-g as follows: 

8 t (q,a)=b- 1 (8'(b(q),a)). 

In our online construction we compute Sp, 8' and all values of b and b^ 1 for 
every state a. Let us note that if we directly compute for a given state 

\X\ 6 Q by reversing some Y G fx] and traversing C {%V rev jf) with Y rev from 
the initial state, the total time for the whole construction would be in the worst 
case quadratic w.r.t. 1 |. To achieve linear time we need to compute &([A]), 
given a state [X], in amortized time 0(1). We show that such an efficient online 
computation of b can be based on the suffix links provided for every state [X] 
during the construction of C (ffV%). 

< — > < — > <— <— > 



Definition 5.14 The suffix link sl([X]) of a state [X 



V* is the longest suffix of X* such that V" [X]. sl( ef) is not defined 



in C (#2?$) is [Y] where 
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Proposition 5.15 Let 

"C{#V%) = (Q,£ #$ ,B,<%,n 
£($2>™#) = (Q',£ #$ ,0,<5',F'), 

fx] e Q, sl([X\) = [F] and b(\Y\) = q' . Let j = \*J?\ - cr = X*j. Then 

there exists a a -transition from q' . Let S'(q',aU) = p' be the a-transition from 

q'. Then b(fx]) = p' . 

Proof. The a-transition from q' = jV re ' 1 '] is defined, since Y r ™ is prefix of X re \ 
which implies that there is a path with label A^ j A^ ' j-\ ■ ■ ■ Aj in C ($2? rel, #) 
from q' to (x re ' u ) . If we assume that this path is not composed of one single 
transition, then for the last intermediate state \Z rev ] of this path we have that 
V is suffix of 1?, V is longer than Y and Z ^ [X], which contradicts with 
slitx]) = \Y}. □ 

The algorithm of Figure [3] calculates C r (#D$), t($V rev #) : b and b' 1 , given the 
lexicon #2?$. The states of these two CDAWGs are consecutive integers starting 
from 0, which is the initial state. The function AddStringInCDAWG( C , #W%) 
represents the online construction of CDAWG invented by Inenaga et al., [IHS + 05j. 
AddStringlnCDAWG adds the string #W% to t '. AddStringlnCDAWG 
changes its first argument C by adding new consecutive states in Q and F and 
by setting the transition function <5tr and the suffix links si for every new state. 

<H> i — \ 
AddStringlnCDAWG never changes X for every state [X] that is already in 

Q. Hence the computation of b is stable in the sense that once b(q) is pre- 
computed for a given state q, further changes of b(q) are impossible. Based on 
Proposition 15.151 the function FindState recursively calculates the values of b. 
In line 5 of FindState we use end(i), the concatenation D of the strings accu- 
mulated so far described in Remark 15.121 and the length of state s defined as 
the length of the longest member of the equivalence class represented by s. The 
lengths of the states are computed by AddStringlnCDAWG. The bottom of 
the recursion is guaranteed by 6(0) = and the decreasing lengths of the input 
states provided in recursive calls. The number of times FindState is invoked is 
0(\Q\). Since the time for the online construction of CDAWG is 0(||#X>$||) we 
obtain the following. 

Proposition 5.16 The online algorithm on Figured runs in time 0(||#2?$||). 

The SCDAWG for can be considered as a compact version of the bidirec- 
tional suffix trie BiSTrie(#V$), Definition EH 

Example 5.17 The SCDAWG for the example lexicon 
#D$ = {#ear$, #lead$, #real$} 
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BuildS D AW G{ #2?$ ){ 



1 Q = {0}; S = 0; F = 0; Q' = {0}; 6' = 0; F' = 0; 

2 t = (Q, 0, <5, F); t' = (Q', 0, <5', F'); 6 = 0; fe" 1 = 0; 

3 6(0) = 0; &- x (0) = 0; 

4 for( #W$ G #2?$ ){ 

5 n=\Q\; 

6 Add5trinfliJnC7£)AWG(C,#WS); 

7 AddStringInCDAWG( < C',$W rev if=); 

8 i = n; 

9 while( i < |Q| ){ 

10 b(i) = nil; i = i + 1; 

11 } 

12 £ = n; 

13 while(i < |Q| ){ 

14 if( b(i) == nil ){ 

15 b(i) = FindStateCc, tf', b, b" 1 , i); b' 1 ^)) = i; 

16 } 

17 i = i + l; 

18 } 

19 } 

20 return (C , 

FindState( tf, tf', 6, 6" 1 , i ){ 

1 s = si(£); 

2 if( 6(s) == mi ){ 

3 &(s) = FindState(c X' \b,b~ x ,s); b- x (b{s)) = s; 

4 } 

5 q' = b(s);j = end(i) — length(s); a = Dj] 

6 let S'(q', all) = p' be the er-transition from q' in C ' 

7 return p'; 

} 



Figure 3: Online construction of a representation of SCDAWG for #T>% as 
(C (#25$), C ($V rev #),b, b- 1 ) 
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Figure 4: SCDAWG for {#ear$, #lead$, #real$}. 



is shown in Figure^ The dashed transitions represent 5-g , while the solid transi- 
tions represent <5<j. The equivalence classes are - {e}, 1 - {l}, 2 - {#}, 3 - {$}, 4 
- {a, eaj, 5 - {r}, 6 - {d$, ad$, lead$, #lead$}, 7 - {1$, al$, eal$, real$, #real$} 
and 8 - {r$, ar$, earl, #ear$}. 

5.2 Bidirectional online search using SCDWAGs 

We now describe how the above index structure is used for computation of solu- 
tion sets defined in Section l4T2"1 We assume that the following offline resources 
are available: 

^ Y 

1. the SCDAWG C (#£>$), in particular the two transition funtions 6<£ and 

**; 

2. start(fv\) and end(fv^j) for every state fv\, Remark 15.121 

3. startit) for every transition in 5-*g, Remark 15. 121 

4. end(t) for every transition in Remark 15. 121 

We keep track of the following online information - here W denotes the 
substring of a lexicon word faced at a certain point of the computation of solution 
sets and D denotes the concatenation used in the linear representation of the 
SCDAWG *t (#£>$), EH 

1. the length of W; 
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2. (the number of) the state [W]; 

3. the unique fProposition l5.4[) position jw of W in D such that start {[W]) < 
jw < end{\W}). 

Let a G We consider possible extensions of the current substring W to the 

right of the form Wa as follows. If j w + \W\ < end(W\), then Wa € Subs(#V$) 
iff <t = D jw+ \ w \ and if a = D, JW+ \ W \, then [Wa] = \W\ and jwa = jw- If 
jV + \W\ > end([W\), then Wa € Subs(#X>$) iff there exists a er-transition 
from [W] in Let t = [W] — > [V] be the cr-transition from [W] in S^r. Then 



[Wa] = [V] and jvFcr = start (t) — \W\. Possible extensions to the left of the 
form aW are handled similarly by using start ({w\) and end(t). 

^ y 

Example 5.18 One example for the use of C (#2?$) in Figure 0] is the fol- 
lowing. We first want to check if e is a substring in Subs(#T>$). For this 
aim we start from state and follow the e-transitkm t = ^ 4 in <5<r, 
j e = start(t) = start(4) = end(A) — 1, the number of the state [e] is 4. 

• We now want to find all left extensions with a single letter. Since j e = 
start(4), we have to use the three possible dashed transitions from state 
4. With # we reach 8, the number of the state [#ej is 8, j# e — start(8) = 

end(8) — 4. With r we reach 7, the number of the state (re) is 7, j re = 
start(7) + 1 = end(7) — 4. With 1 we reach 6, the number of the state 

[le] is 6, ji e = start(Q) + 1 = end(Q) — 4. 

• We now want to find all right extensions of e with a single letter. Since 
j e + 1 = end(A) , the only one possible extension to the right is with letter 

a, the number of the state [ea] is 4, j ea = j e = start(4) = end(4) — 1. If wc 
want to further extend ea to the right, we have to use the solid transition, 
because j ea + 2 > end(4) . 

Remark 5.19 In our actual implementation we use a simple optimization of 
the approximate search based on additional information stored in the SCDAWG. 
The idea is to use "positional'' information to recognize blind paths of the search. 
Consider a substring V of a lexicon word. If some lexicon word W has the 
form W = RVS we say that \R\ (resp. 15*1) is the length of a possible prefix 
(suffix) for V. In the SCDAWG we store for each substring the maximal and 
minimal length of a possible prefix (suffix) for V . When computing solution 
sets SoId(P' , &')) substrings Po of the pattern P are aligned with substrings V 
found in the SCDAWG. Each substring P defines a unique prefix Rp and a 
unique suffix Sp of the pattern P = RpP Sp. We check if the length of Rp and 
Sp is "compatible" with the information stored in the SCDAWG for the length 
of possible prefixes and suffixes for V. To test "compatibility", the error bound 
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and the distance between Pq and V has to be taken into account. Compatibility 
is checked each time we reach new state of the SCDAWG. We omit the technical 
details. 

6 Evaluation 

In this section we compare our new method to two other methods for efficient ap- 
proximate search in lexica, Oflazer's approach |Ofl96j and the forward-backward 
method introduced in |MS04] . To have a common basis for the experiments we 
always use as a filter mechanism Ukonnen's optimized matrix method |Ukk85j 
For the three methods we present experimental results for approximate search 
in lexica of different sizes and types. We also look at the dependency of search 
times on the notion of similarity used. In order to get a picture of principle 
limitations for approximate search we also present evaluation results where we 
simulate the "ideal method". 

The "ideal method" for bound b £ N and dictionary T> is based on a perfect 
index I(D, b) that directly maps every query (P, b) to the solution Solr>(P, b) n 
T>. Since the size of the perfect index I(T>, b) would be too large, for every 
experiment we build a restricted perfect index I(T, T>, b) that works only for a 
small finite test set T C S* of query strings. For every query string P £ T the 
restricted perfect index I(T,T>,b) maps (P, b) to the solution Solx>(P,b) n V. 
We represent the restricted perfect index I(T, T>, b) as an acyclic fc-subsequential 
transducer for k = maxp e q-\Solx>{P 1 b) D T>\. An online algorithm for building 
minimal acyclic fc-subsequential transducers is introduced in |MJV101| . This form 
of representation is optimal since the only time used is the time for reading the 
input and directly producing the desired output. 

6.1 Comparison of search times for different methods 

For our first series of experiments we chose a lexicon T> of I, 200, 070 book titles. 
The average length of titles is 47.64. The number of different symbols in the 
alphabet of the lexicon is 99. We compare search times obtained for Oflazer's 
method |Ofl96j . the forward-backward method [MS04] . the new method and 
the "ideal method". In all experiments we set the weight of each nonidentity 
edit operation op to w(op) = 1. We then vary the distance bound b from 2 
to 15. For each bound b we generated a test set T of 10,000 query strings. 
Each query string P was received from a randomly chosen string W £ T> by 
applying randomly b operations from the set of edit operations Op to W such 
that |P| > 36. 

All experiments were run on a machine with 64 gygabytes of RAM, two 2.4 
GHz Quad-Core Intel Xeon 8-core processors, 256 KB L2 cache memory per 
core and 12 MB L3 cache memory per processor. Our implementation uses only 

3 Universal Levenshtein automata [SM02 MS04] are more efficient but can only be built 
for small distance bounds because of huge memory requirements. 
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Lcvcnslitcin distance, lexicon of book titles 


b 


new / ideal 


fb / ideal 


f / ideal 


ideal (ms) 


2 


13.87 


325.083 


3817.47 


0.004 


3 


22.80 


686.037 


20063.82 


0.004 


4 


50.72 


3904.30 




0.004 


5 


54.24 


7741.55 




0.005 


6 


76.36 






0.005 


7 


86.55 






0.005 


8 


173.98 






0.006 


9 


154.32 






0.006 


10 


172.49 






0.006 


11 


163.06 






0.007 


12 


287.94 






0.007 


13 


261.13 






0.007 


14 


301.03 






0.008 


15 


300.05 






0.008 



Table 1: Comparison of search times for four different methods, standard Lev- 
enshtein distance, dictionary of titles, f - Oflazer's method, fb - the forward- 
backward method, new - the new method, ideal - the "ideal method". Empty 
cells mean that we did not wait for termination. Explicit search times (in ms) 
are only given for the ideal method (last column). All other entries represent 
factors, comparing the given method with the ideal method. 

one thread. The amount of memory needed for our experiments is determined 
by the size of the precomputed indesQ. 

Table [T] presents results obtained for the standard Levenshtein distance. 
Column 1 specifies the value of the distance bound b used in the experiments. 
Explicit search times are only presented for the ideal method (column 5, times 
in milliseconds). Numbers x in Table [T] for some method M mean that the ideal 
method was x times faster than method M for the problem class. For exam- 
ple, the entry 13.87 found in row/column 2 indicates that approximate search 
using the new method presented above with distance bound 2 and standard 
Levenshtein distance on average took 13.87 times the time needed by the ideal 
method. Here, as in all experiments, the time needed to write the output words 
is always included. Empty cells found in the table mean that we did not wait 
for the respective method to finish. 

The results in Table Q] show that the method presented in this paper comes 
"close" to the ideal method for small distance bounds when using the standard 
Levenshtein distance. For the given lexicon of titles, which contains long strings, 
the new method is dramatically faster than the forward-backward method, 
which in turn is much faster than Oflazer's method. It is worth to note that the 
differences become more and more drastic when using larger distance bounds. 
For these bounds only the new method leads to acceptable search times. 

4 We use depth first implementations of the evaluated methods, see Remark l4,lll 
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b 


Lcvcnshtcin distance, lexicon of MEDLINE sentences 


new / ideal 


lb / ideal 


1 / ideal 


2 


8.64 


71.62 


664.91 


3 


10.72 


145.60 


2670.98 


4 


13.55 


727.22 




5 


16.57 


1357.87 




6 


24.54 






7 


27.23 






8 


36.47 






9 


40.68 






10 


59.48 






11 


62.36 






12 


123.31 






13 


123.84 






14 


145.98 






15 


146.11 






20 


411.71 






30 


1552.28 






40 


4872.49 






50 


23869.91 







Table 2: Comparison of search times for different methods, standard Levenshtein 
distance, dictionary of MEDLINE sentences, f - Ofiazer's method, fb - the 
forward-backward method, new - the new method, ideal - the "ideal method". 
All entries represent factors, comparing the given method with the ideal method. 
Empty cells mean that we did not wait for termination. 



6.2 Comparison of search times for language databases 
with sentences 

For our second series of experiments we use a collection of sentences from the 
life sciences and biomedical domain. The lexicon consists of all sentences from 
43,000 paragraphs which were randomly chosen from MEDLINE abstracts^]. 
The number of sentences in our list is 351, 008. The average number of symbols 
per sentence is 149.26. The size of the lexicon is approximately the same as 
the size of the lexicon of titles, but the strings are longer. Table [5] presents the 
comparison of the different methods for the standard Levenshtein distance. As 
a new challenge, the distance bound b used for approximate search varies from 
2 to 50. Note that for previous methods the use of larger distance bounds leads 
to unacceptable search times. Speed-up factors are similar to those observed in 
Table [TJ 

6.3 Comparison of search times for different variants of 
Levenshtein distance 

For our third series of experiments we compare search times obtained for three 
notions of similarity, (i) the standard Levenshtein distance, (ii) the variant where 
transpositions of neighbored symbols are treated as additional edit operations, 

5 MEDLINE is a bibliographic database of U.S. National Library of Medicine. 
MEDLINE contains over 19 million references to journal articles in life sciences, 
www.nlm.nih.gov/pubs/factsheets/medline.html. 
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Lcvcnshtcin distance with transpositions, lexicon of book titles 


b 


new / ideal 


fb / ideal 


f / ideal 


2 


19.69 


346.28 


4081.83 


3 


36.00 


1022.59 


20536.91 


4 


93.36 


4767.08 




5 


108.54 


11861.07 




6 


149.84 






7 


167.16 






8 


315.19 






9 


271.12 






10 


324.32 






11 


310.59 






12 


467.27 






13 


459.23 






14 


500.79 






15 


496.84 







Table 3: Comparison of search times for the variant of Levenshtein distance 
transpositions of neighbored symbols are treated as edit operations, f - Oflazer's 
method, fb - the forward-backward method, new - the new method, ideal - the 
"ideal method". All entries represent factors, comparing the given method with 
the ideal method. Empty cells mean that we did not wait for termination. 

and (iii) the variant where also merges and splits are used as additional edit 
operations. Tables [3] (resp. Table @| presents results obtained for the variant 
of Levenshtein distance where we also use transpositions (merges and splits) as 
operations. 

Basically, the results in Tables [3] and 2] are similar to the previous results: 
also for the modified distances, the new method is much faster than previous 
methods, with a speed-up factor of at least 15. For large distance bounds the 
speed-up factor is larger. 

6.4 Comparison of search times for different symbol dis- 
tributions 

In our fourth series of experiments we ask how the statistical properties of the 
distribution of letters in the lexicon words influence search times. We generated 
two random dictionaries of 1, 200, 070 strings - one with uniform distribution of 
99 symbols and average string length 54.35 and another one with binomial dis- 
tribution of 99 symbols and average string length 54.63. In Table[5]the columns 
"Lev+Binomial" and "Lev+Uniform" present the behavior of the algorithms for 
the two random dictionaries with binomial and uniform distributions of the 
symbols. 

The differences between the search times for the three methods for approxi- 
mate search observed in Table Q] basically remain unchanged. For bound b = 2, 
search in the natural language lexicon of titles (Table [T]) is faster than search 
in the lexica with binomial distribution and is slower than search in the lexica 
with uniform distribution. For larger distance bounds, the differences between 
the search times for the three types of lexica are more difficult to interpret. 
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Lcvcnshtcin distance with merges and splits, lexicon of book titles 


b 


new / ideal 


fb / ideal 


f / ideal 


2 


51.69 


2297.07 


37428.91 


3 


155.55 


7199.24 




4 


693.20 


42141.64 




5 


788.27 


112676.78 




6 


1091.35 






7 


1229.88 






8 


2267.31 






9 


1868.38 






10 


2462.66 






11 


2151.49 






12 


3352.05 






13 


3077.23 






14 


3350.26 






15 


3210.70 







Table 4: Comparison of search times for the variant of Levenshtein distance 
where merges of two symbols into one and splits of a symbol into two symbols 
are treated as edit operations, f - Onazer's method, fb - the forward-backward 
method, new - the new method, ideal - the "ideal method". All entries represent 
factors, comparing the given method with the ideal method. Empty cells mean 
that we did not wait for termination. 



b 


Lcv+ Binomial 


Lev— Uniform 


new / ideal 


fb / ideal 


f / ideal 


new / ideal 


fb / ideal 


f / ideal 


2 


18.12 


435.29 


12513.00 


10.68 


1686.54 


74164.46 


3 


20.12 


861.60 


103772.38 


12.14 


3392.04 


262620.28 


4 


22.04 


11936.27 




14.59 


73079.20 




5 


25.30 


23053.83 




17.33 


136324.69 




6 


38.04 






25.22 






7 


41.01 






27.06 






8 


53.56 






33.69 






9 


60.33 






39.79 






10 


81.73 






56.57 






11 


89.99 






61.09 






12 


156.64 






114.70 






13 


164.48 






124.96 






14 


188.93 






137.76 






15 


188.52 






136.48 







Table 5: Comparison of search times for different symbol distributions. Ran- 
domly generated lexica. f - Onazer's method, fb - the forward-backward method, 
new - the new method, ideal - the "ideal method". All entries represent factors, 
comparing the given method with the ideal method. Empty cells mean that we 
did not wait for termination. 
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b 


Lcvcnshtcin distance, lexicon of Bulgarian word forms 


new / ideal 


ib / ideal 


f / ideal 


2 


85.86 


207.10 


1980.55 


3 


105.28 


510.04 


9236.49 


4 


420.32 


1905.09 





Table 6: Search times for Bulgarian lexicon with short strings showing the influ- 
ence of the length of lexicon entries. Standard Levenshtein distance, search times 
for three methods, f - Oflazer's method, fb - the forward-backward method, new 
- the new method, ideal - the "ideal method". 



method 


new 


fb 


f 


Titles 


1104.63 


110.32 


54.18 


MEDLINE 


921.41 


100.02 


49.87 


Bg word forms 


61.02 


9.22 


4.17 



Table 7: Dictionary of titles and dictionary of MEDLINE sentences vs. smaller 
dictionary, sizes of indexes in megabytes. 

6.5 Influence of the length of the strings in the lexicon 

In our last experiment we look at the influence of the length of the strings in 
the lexicon. We selected a smaller dictionary of natural language expressions 
consisting of approximately 450,000 Bulgarian word forms with average word 
length 10.01. In Table [5] the corresponding search times for the smaller dic- 
tionary are found in columns 2 — 4. Since for every query string P we require 
\P\ > 36, for bounds b > 4 there are less than 10,000 entries in the smaller 
dictionary from which we could generate queries. For this reason in the case 
of the smaller dictionary we do not present results for b > 4. Even for the 
short strings of the Bulgarian lexicon, the new method is much faster than the 
forward-backward method and the third method. The speed-up gained is less 
drastic than for the lexicons of titles and MEDLINE sentences, and here the 
"ideal method" remains more than 85 times faster than the new method. 

Size of index structures. Table [7] represents for every method, except the 
ideal one, the sizes in megabytes of the indexes compiled from the dictionary 
of titles, the dictionary of MEDLINE sentences and the dictionary of Bulgarian 
word forms. 

7 Historical remarks, possible applications and 
conclusion 

We introduced a new method for fast approximate search in lexica that can 
be used for a large family of string distances. The method uses a bidirectional 
index structure for the lexicon. This index structure can be seen as a part of 
a longer development of related index structures starting with work on suffix 
tries, suffix trees, and directed acyclic word graphs (DAWGs) |Wei731 IMcC76[ 
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IUkk95l ICSM IBBH+85] . These index structures address single texts and are 
one-directional in the sense that search for substrings of the given string/text 
follows the left-to-right reading order. In |Sto951 IStoOOl IMaaOOl |IHS + 01| it 
has been shown how to obtain bidirectional index structures for strings/texts, 
supporting search for substrings using both left-to-right and right-to-left reading 
order. One-directional index structures for sets of strings (as opposed to single 
strings) have been described in |BBH+87l IGus971 IBre981 IMMW091 IIHS+05] . 
In each case the challenge is to find an index structure with size linear in the 
size of the input text or lexicon, with a linear-time construction algorithm. In 
|BBH + 87] a bidirectional index structure for sets of texts is briefly sketched, 
asking for natural applications. In this paper we have seen that such an index 
applied to lexica can be used to realize a very fast method for approximate 
search. 

With the new index, the "wall effect" mentioned in the Introduction can 
be avoided. Among related techniques, the BLASTA method |AGM + 90| is 
worth mentioning. In this approach, the occurrences of specific substrings in 
the lexicon are indexed in order to reduce the lexicon words to be considered. It 
assumes that each answer of the query has to contain at least one of the keyed 
substrings which allows it to start with an exact match of such a promising 
substring. In such a way BLASTA prunes the initial exhaustive search and 
proves to be efficient. However since there is no guarantee that all answers of 
the query meet this condition, it may fail to retrieve the complete list of words 
satisfying the query. 

Our evaluation results show that the new method is much faster than previ- 
ous methods, and for lexica with long strings the speed-up is drastic. Here the 
new method for distance bound 6 = 2 comes close to the theoretical limit when 
using the standard Levenshtein distance. 

We add a brief comment on possible applications. As a matter of fact, the 
method may be used to speed-up traditional spelling correction techniques. For 
high quality spelling correction, speed is not the only issue. Current approaches 
typically use probabilistic techniques at two places. First, good similarity mea- 
sures for selecting candidates are based on special edit operations with weights 
depending on the particular symbols/strings used. How to find appropriate edit 
operations and weights is a question beyond the scope of the current paper. 
However, the framework of a generalized distance we use to model similari- 
ties should be general enough to cover most interesting cases. Second, when 
looking for an optimal correction suggestion for a misspelled token, language 
models (e.g., weighted word trigrams) help to find a correction suggestion that 
fits the local context. Still, similarity search in the background lexicon only 
looks at single tokens and for efficiency reasons, "context sensitive" correction 
suggestions for distinct tokens are often computed in isolation. An interesting 
question is if better results are obtained when using larger contexts already for 
the background lexica and similarity search. This strategy would guarantee that 
the correction suggestions obtained for a sequence of tokens always fit together. 
The method introduced above offers new possibilities for testing such a strategy 
since we can use large strings and distance bounds. As a matter of fact, issues of 
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smoothing have to be taken into account when trying to synchronize contextual 
similarity search and language models. 

Possible application areas of the new method are not restricted to traditional 
fields of approximate search such as spell checking, text and OCR correction. 
Since the method is fast enough to deal with collections of long strings and 
large distance bounds, it seems promising to test its use, e.g., for detecting 
plagiarism, for finding similar sentences in translation memories and related 
language databases, and for approximate search in collections of address or 
bibliographic data. We currently also look at a variant of the method for fast 
approximate search of patterns in an indexed collection of texts. In order to 
find all approximate matches for a string in a (collection of) texts, the index 
has to be enriched by adding information on the positions of all occurrences of 
each infix. The challenge is to keep the size of the index linear in the size of the 
text(s). 

A remaining open question is the time complexity of the presented algorithm. 
A desirable approach would be to estimate the average complexity in a way 
similar to Myers' |Mye94| . 

There are two obvious ways how search times presented above could be 
immediately improved. First, for small distance bound we could use universal 
Levenshtein automata |MS04j as filters. This leads to a performance gain as 
compared to matrix based filters [MMSllj . Second, an additional speed-up 
could be obtained by running subsearches of distinct branches of the search tree 
used in parallel. The optimal selection of search trees is an interesting point for 
further investigations. 
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Appendix 

We show how the search strategy described in Section 2] can be adapted to 
the case of an arbitrary generalized distance d = (Op,w). In what follows, 
w m ax denotes the maximal width of an operation op G Op. To simplify the 
following description, we introduce the notion of a (left, right) reduct of a word. 
Intuitively, reducts of a word U are obtained by deleting a "short" (possibly 
empty) prefix and/or suffix of length < u> max from U. 

Definition 7.1 Let U <E S* be represented in the form U — U\ o XJ-z- If \U\\ < 
u m ax, then Ui is called a left reduct of U. If IE/2I < oj max , then U± is called a 
right reduct of U. If U = U\ o V o U2 and both \U\\ < ui max and IE/2I < Umax, 
then V is called a reduct of U. 

We denote that for u max = 1 always U is the only reduct of U. The formal 
background for the adapted search procedure is provided by the following gen- 
eralization of Proposition 14.71 
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Proposition 7.2 Let P' = P[ o P 2 and a be an alignment with 1(a) = P' and 
w(a) < b' , then: 

1. a can be represented in the form a — ai o /3 o a 2 such that j3 G Op U {e}, 
l(cti) is a right reduct of P[, and l(a 2 ) is a left reduct of P[. 

2. for each such decomposition and integers b[ and b' 2 with b[ + b' 2 = b' — 1 
it holds that w(a{) < b[ or w(a2) < b' 2 . 

Proof. We first prove Part 1. Let a.\ denote the maximal prefix of a with the 
property that l(a±) is a prefix of P[. If l(a\) = P[ we define j3 := e. Otherwise 
there exists an operation op £ Op such that l(a\) is a proper prefix of P[, the 
latter being a proper prefix of l(a± o op). In this case we define (3 := op. In 
both cases ot^ is now determined by the equation a = a\ o /3 o ct^- It is trivial 
to check that this representation has the properties stated above. The second 
statement follows easily. □ 

Recall that in the special situation considered in Section |4] we decomposed 
the pattern P into subparts P = P\oP 2 - ■ -oP b+ll and for substrings of the form 
P k o ■ ■ ■ o Pi (k < I, possible combinations of k, I determined by the structure of 
the search tree) we computed approximate matches with substrings of lexicon 
words using distinct bounds. In the general situation considered here we split 
P as above. We then try to find approximate matches between reducts of the 
substrings P k o- ■ - oPi with substrings of lexicon words. For a formal description, 
let us introduce another notational convention. By r(i,U,j) we denote the 
reduct obtained from U by deleting the unique prefix and suffix of length i and 
j, respectively. Hence r(0, U, 0) = U. 

Building the generalized search tree for a pattern. For a given input 
pattern P and a bound b, let Tp denote the search tree defined in Section 21 
With each query (P', b') decorating a node n we associate as a subcase analysis 
the set of all derived queries of the form (r(i,P',j),b') where i,j < Lo max . The 
problem considered at node n is to solve all derived queries of the above form. 
Note that (r(0, P, 0), b) is equivalent to (P, 6). 

Computation of solution sets for derived queries. For each derived 
query (r(i, P' ,j),b') of the generalized tree we compute a set Sr>(r(i, P' , j),b') 
in a bottom-up fashion. We shall prove below that So{r(i,P',j),b') is the 
solution set Soln{r(i,P' ,j),b') in each case. 

Initialization steps. For a derived query (r(i, P', j), 0) at a leaf we de- 
cide if r(i,P',j) is a substring of a lexicon word. In the positive case we let 
S D (r(i,P',j),0) := {r(i,P',j)}, otherwise we define S D (r(i,P',j),0) := 0. 

Extension steps. Let ((r(i, P' ,j),b') denote a derived query at a non-leaf 
node r] of Tp, let let (P[ : b' 1 ) and (P^b^) denote the main queries of the two 
children 771,772 of 77, which are given in the natural left-to-right ordering. Given 
all sets SD((r(i 7 P[ 7 ji),b[) and 5 , £>((r(i 2 , P^j), b' 2 ) for the derived queries at 
771, 772 we define Sn{r(i, P',j), b') as the union of the two sets Si and S* 2 defined 
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as 

51 = {UoVeSubs(V) | U ES D (r(i,P[,j 1 ),b' 1 ),dr i <b'} 

J 1=0 

5 2 = |J o (7 £ Subs(2?) | U eS D (r(i 2 ,P^j),b' 2 ),d* <&'}}. 

■h=o 

Here d\ = d(r(ii,P[,ji), U) + d(Qx, V) where Qi is obtained from P' = P{P£ 
by deleting the prefix of length \P{\ — j\ and the suffix of length j. Similarly 
d'2 = d{r(i2, P2, j), U) + d(Q2, V) where Q 2 is obtained from P' by deleting the 
prefix of length i and the suffix of length |Pj| — *2- 

Proposition 7.3 The computation of solution sets is correct: for each derived 
query (r(i,P',j),b') we have S D (r{i, P' , j),b') = Sol D (r{i,P',j),b'). 

The proof is a simple modification of the earlier correctness proof. We just 
use Proposition 17.21 instead of Proposition 14.71 
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